Skip to content

fix: reward scaling, PPO clipping, ELA memory cap, and eval metrics#4

Merged
wniec merged 5 commits into
DAS2from
code-review
May 29, 2026
Merged

fix: reward scaling, PPO clipping, ELA memory cap, and eval metrics#4
wniec merged 5 commits into
DAS2from
code-review

Conversation

@Grzmro
Copy link
Copy Markdown
Collaborator

@Grzmro Grzmro commented May 19, 2026

No description provided.

@Grzmro Grzmro requested a review from wniec May 19, 2026 09:03
Comment thread agents/rl_das/network.py
nn.Linear(64, 1),
nn.ReLU(),
# No second ReLU: movement vectors are signed displacements.
# Clamping to >= 0 discards direction — the network cannot tell
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense

Comment thread das/env/bbob_splits.py Outdated
ALL_FUNCTIONS = set(range(1, 25))
INSTANCE_IDS = [1, 2, 3, 4, 5, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80]
EASY_TRAIN_FUNCTIONS = {4, *range(6, 15), 18, 19, 20, 22, 23, 24}
EASY_TRAIN_FUNCTIONS = {1, 2, 3, 4, *range(6, 15), 18, 19, 20, 22, 23, 24} # Czy tutaj nie powinny być też funkcje 1,2,3?
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"""
BBOB (F1-F24)

Difficulty Mode Training Set Testing Set
easy 4, 6-14, 18-20, 22-24 1, 2, 3, 5, 15, 16, 17, 21
difficult 1, 2, 3, 5, 15, 16, 17, 21 4, 6-14, 18-20, 22-24
"""

EASY_TRAIN_BBOB = {4, *range(6, 15), 18, 19, 20, 22, 23, 24}
ALL_FUNCTIONS = {_ for _ in range(1, 25)}
INSTANCE_IDS = [1, 2, 3, 4, 5, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80]
DIMENSIONS = [2, 3, 5, 10, 20, 40]

this is how it was handled in DynamicAlgorithmSelection 1. Easy refers here to sceanarion, not problem. I think that maybe naming could be changed here but not the easy/hard split. Or maybe we. could entirely omit it and focus only on random splits and CV?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry if it looked like I was screaming, accidentally inserted "#"

Comment thread das/env/bbob_splits.py Outdated
# random 2/3 – 1/3 split
all_ids = build_problem_ids(ALL_FUNCTIONS, dims)
rng = np.random.default_rng()
rng = np.random.default_rng(seed)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_seed(args.seed) function should handle global seed setting. However I feel that it is good idea to use the same constant seed (0) to split train and test

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the other hand it is in fact change of logis from RLDAS and DAS1, so if it is not necessary now, I would leave it as it was.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the DAS training stack to improve reward scaling consistency, PPO stability, ELA feature computation efficiency/memory use, and the usefulness/reproducibility of evaluation metrics across BBOB problems.

Changes:

  • Adds configurable/cached ELA feature recomputation and caps stored ELA history to avoid large RAM growth.
  • Tightens PPO behavior (value clipping from first epoch; extra diagnostics logging; minor network/NaN-guard improvements; safer checkpoint loading).
  • Improves evaluation reporting by switching from raw best_y aggregation to gap-to-optimum metrics and by making final eval use a fresh env.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
train.py Adds CLI flag to control ELA recomputation frequency.
das/training/rldas.py Avoids mutating args, improves eval reproducibility, logs gap metrics, ensures results dir exists.
das/training/ppo.py Passes ELA recomputation config into SB3 env factory config.
das/training/common.py Threads ela_recompute_every through make_das_env config.
das/env/reward.py Fixes first-step reward scaling to avoid out-of-range values for “linear/sparse/binary” rewards.
das/env/observation.py Improves ELA robustness, reduces action-history feature cost, allows precomputed ELA injection.
das/env/das_env.py Adds ELA caching + history cap to reduce compute and memory overhead.
das/env/bbob_splits.py Adds deterministic seeding for random split; modifies easy-train function set.
agents/rl_das/trainer.py Uses public env accessor, adds PPO diagnostics logging, clarifies bootstrap handling.
agents/rl_das/network.py Removes invalid ReLU constraint for signed movement embeddings; adds NaN guard to critic.
agents/rl_das/env.py Exposes problem_ids via a public property.
agents/rl_das/agent.py Applies PPO value clipping from epoch 0; uses torch.load(..., weights_only=True) for safer loads.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread das/env/das_env.py Outdated
Comment thread das/env/bbob_splits.py Outdated
Comment on lines 25 to 26
def get_train_test_split(mode: str, dims: list[int], seed: int = 0) -> tuple[list[str], list[str]]:
"""Return (train_ids, test_ids) for the given split mode and dimensions.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true

Comment thread das/env/das_env.py Outdated
reward_option: int = 1,
n_individuals: int = 100,
seed: int | None = None,
ela_recompute_every: int = MAX_HISTORY_SAMPLE // 5 # ~500,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that caching ELA features is a good solution to reduce computations. Maybe some of them are unnecessary. I think, that better solution would be to omit some of them. I think that less than 10 (5-6) may be predictive. This is a good point to start and check their importances maybe. I'll send you models trained on the new portfolio I got, so we could analyse its behaviour and most important features to leave. Recently I also played with this: https://scikit-learn.org/stable/auto_examples/decomposition/plot_incremental_pca.html - may come to be very useful. Many ELA features are in general correlated. those are deleted here. However in DAS specific case it may be possible that other correlations still appear. IPCA could also tell us which features are correlated. We could incrementally fit it in the rollout filling phase when we also compute std and mean for rewards and state. This is just an idea (maybe crazy). In general I think that throwing out some features will be a better idea than caching them.

Comment thread das/env/observation.py Outdated
unique_idx = np.sort(unique_idx)
x = x[unique_idx][-MAX_HISTORY_SAMPLE:]
y = y[unique_idx][-MAX_HISTORY_SAMPLE:]
# Slice to the most-recent samples first; deduplication is done below
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deduplication is really a thing for ELA features computation to work. Pyflacco breaks if some x appears twice xD. This happens very rarely I think. Maybe in later phases when population collapses into a one point. but I think that an option (alternative to the one mentioned in the other comment) would be implementation of some early stopping mechanism. We detect 0 (or close to 0) variance in population -> we stop with our computation. The issue here might be that with incosistent optimization length, metrics such as AOCC stop making sense. What do you think @WojtAcht ?

Comment thread das/env/reward.py
return (old_best_y - new_best_y) / (scale + 1e-10)


def reward_log_scaled(new_best_y, old_best_y, initial_range, is_final=False):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's the issue with the reward you mentioned earlier. 0.0 in the beginning is certain to be hacked. Agent would choose worst possible optimizer first so the subsequent ones can make the biggest improvement (as nothing is optimized) - this happened to us before. Implementation of logarithm at least forces to choose most exploratory one, so difference between best and worst y is big. This is also going to be hacked, but hopefully in a less painful way. As you mentioned earlier, reward choice is both important and very hard. Maybe it would be nice to add more options for reward?

Comment thread das/training/rldas.py Outdated
# get_portfolio already raises for unknown names; this guard catches an
# empty --portfolio list before anything expensive is initialised.
raise ValueError(
"Portfolio is empty — specify at least one optimizer name via --portfolio."
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice validation

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified Idea a bit. Now RLDAS should have empty portfolio parameter as it should only use the original paper's one.

Comment thread das/env/reward.py
"""Binary: 1 if improvement >= 0.1%, else 0 (original r4)."""
if old_best_y == float("inf"):
return float(np.log(initial_range[1] - initial_range[0] + 1e-10))
return 0.0
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is important change of behaviour. I think further portfolio study will be needed. for now I'll leave the reward as it was before

Comment thread das/env/observation.py
[choices_history.count(j) for j in range(n_actions)], dtype=np.float32
)
# O(n) instead of O(n_actions * n_steps) from calling list.count in a loop.
counts = np.bincount(choices_history, minlength=n_actions).astype(np.float32)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@wniec wniec merged commit d014d6a into DAS2 May 29, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants